An Improved Metric and Benchmark for Assessing the Performance of Virtual Screening Models

Structure-based virtual screening (SBVS) is a key workflow in computational drug discovery. SBVS models are assessed by measuring the enrichment of known active molecules over decoys in retrospective screens. However, the standard formula for enrichment cannot estimate model performance on very large libraries. Additionally, current screening benchmarks cannot easily be used with machine learning (ML) models due to data leakage. We propose an improved formula for calculating VS enrichment and introduce the BayesBind benchmarking set composed of protein targets that are structurally dissimilar to those in the BigBind training set. We assess current models on this benchmark and find that none perform appreciably better than a KNN baseline. We publicly release the BayesBind benchmark at https://github.com/molecularmodelinglab/bigbind.


Introduction
Structure-based virtual screening (SBVS) aims to identify compounds that bind to a protein target.Given the 3D structure of the target's binding site, a large library of compounds is scored according to their predicted ability to bind to a protein target, and the top-scoring molecules are selected for experimental validation.
Running a prospective virtual screen is time-consuming and expensive.As such, we desire to validate SBVS models virtually before using them in production.An ideal in silico benchmark would allow us to 1) select the best model out of a set of candidates and 2) give us some indication of how successful this model will perform in a real-life setting.
The current strategy to validate SBVS models is to create separate benchmarking sets for many protein targets.For each target, compounds that are known to bind to the protein ("actives") are combined with either computationally identified decoys (DUD-E [1] and CASF [2]) or experimentally validated inactives (LIT-PCBA [3]).Models are then assessed according to their ability to discriminate actives from inactives.While traditional classification metrics such as ROC-AUC can be used to assess a model's classification performance, they are not the primary metrics we care about [4].Instead, we are mainly concerned with whether or not the top-scoring molecules are active or inactive.
The simplest and most interpretable early enrichment metric is the enrichment factor (EF), parameterized by a selection fraction χ (e.g.1%).EF χ is defined to be the fraction of actives selected in the top χ of the compounds divided by the overall fraction of actives in the set.This metric is easily interpreted as the success rate of the model relative to the expected success rate of random selection.
A fundamental issue with the enrichment factor is that the maximum value achievable is the ratio of inactive to active compounds in the set.For DUD-E, the average decoy to active ratio over all the targets is 61.Libraries in real-life screens, however, have much higher inactive-to-active ratios; therefore, models must achieve higher enrichments to be useful.Consider that high-throughput screens (HTS) regularly test hundreds of thousands of random compounds against protein targets, while virtual screens typically suggest hundreds for experimental validation.Thus, for virtual screens to produce results similar to HTS, models must achieve enrichments of around 1,000 (for sufficiently low χ).For the EF formula to measure such high enrichments, we would need benchmarks with extremely high inactive-to-active ratios.Such a benchmark would need to be very large, thus making it prohibitively time-consuming to test models on it.This benchmark would additionally pose problems with using decoys as inactives; with an enormous number of decoys, many of them would likely bind to the protein in real life simply by random chance.
In this work, we propose a new way of calculating enrichments in VS benchmarks.Our improved formula is just as simple to calculate as the original EF and does not suffer from the difficulties outlined above.Notably, the new formula does not assume that decoys in the set are truly inactive.We denote this new metric the Bayes enrichment factor, or EF B , to distinguish it from the traditional formula.
Additionally, we created a new SBVS benchmark, BayesBind, especially for use with machine learning (ML) models.It is currently difficult to evaluate ML models on current VS benchmarking sets due to issues of data leakage.Properly splitting protein-ligand activity datasets is difficult and it is very easy to achieve good performance due to the similarity between the train and test sets [5][6][7].The targets in the BayesBind benchmark are taken from the validation and test sets of the BigBind dataset [8], allowing easy validation of any model that was trained on the BigBind training set.Even with the rigorous splitting used by BigBind, we removed several targets from the set for which a K-nearest-neighbor (KNN) model did suspiciously well on.

The Bayes enrichment factor
Suppose we are running a virtual screen on N molecules for a particular protein target.Let A be the event that a particular molecule binds to the target, and let S be the score of that molecule given by our SBVS program.χ is our selection fraction, so we choose the cutoff score S χ such that P (S > S χ ) = χ.We select all molecules with scores above S χ .We define the "true" enrichment of our model at this χ to be: The usual formula for the enrichment factor EF χ is an estimate of this true enrichment, where, for example, P (A|S > S χ ) is estimated by the empirical fraction of actives in our chosen set.
Using Bayes' Theorem, it is easy to show that EF T χ is also equal to P (S>Sχ|A) P (S>Sχ) .This ratio can easily be estimated by separately scoring a set of active molecules and a set of random compounds and observing how their score distributions differ.Thus we define our new metric, the Bayes enrichment factor, to be: Fraction of actives whose score is above S χ Fraction of random molecules whose score is above S χ The EF B has several advantages.Most notably, it requires only random compounds (from the same chemical space as the actives) instead of inactives.This is a major improvement considering how much work has gone into producing decoys that are hypothesized not to bind [1].This both eliminates a potential source of error in decoy-based benchmarks and makes coming up with benchmarking sets easier.
Additionally, the EF B has no dependence on the ratios of actives to random in the set, avoiding the problem where the EF maxes out at the overall ratio of actives.Instead, the EF B achieves its maximum value at 1 χ (the same maximum value achievable by the true enrichment).The minimum value of χ we could use the EF B on is 1 N R , where N R is the number of random compounds (this would correspond to S χ being the maximum score on the random set.This is a much more efficient use of the data than the EF, which requires N S χ total compounds to measure enrichment at χ, where N S is the number of selected compounds (if N S is too small, the EF becomes noisy).
We additionally note that, as naïve ratio estimators, both the EF and EF B formulae are biased estimators of the true enrichment EF T .This is an undesirable property, though we leave the creation of an unbiased estimator as future work.For now, we simply recommend paying close attention to the computed confidence intervals around these metrics.

The maximum Bayes enrichment factor
The EF B allows us to estimate enrichments for χ as low as 1 N R , where N R is the number of random compounds in the set.This raises an important question: what value of χ should we measure enrichment at?As can be seen in Figure 1, values at extremely low χ often have large confidence intervals and sometimes become 0 due to the lack of any known actives with scores above S χ .
We propose simply taking the maximum value of the EF B achieved over the measurable χ interval of [ 1 N R , 1], which we denote the EF B max .If we assume that the true enrichment EF T χ increases monotonically as χ → 0 (that is, that the probability of success in a virtual screen increases with the size of the screen), then the EF B max is our best guess at how well the model will do in a real-life screen.We do note that, because the EF B max usually occurs at very low χ, its confidence interval is often very wide.It is unfortunate that we cannot pin down the true value with great certainty, but valuable information can still be gleaned.We specifically suggest using the lower confidence bound as an informative metric.This tells us, with high confidence, that our model will at least this well in a prospective virtual screen for the current target.

Metric comparison on DUD-E
In order to compare EF and EF B metrics, we analyzed the performance of several models on the DUD-E benchmark.These models had been previously benchmarked in Sunseri and Koes [9]; we used the data from their study and re-analyzed it using both the EF and EF B .We report the median metrics across all DUD-E targets in Table 1.
From this table, we can make a couple of observations.First, the EF and EF B agree well at χ = 1% (the most common EF value reported).At χ = 0.1%, however, the EF B results are much higher.This is due to the problem outlined above wherein the EF maxes out at the decoy to actives ratio.The EF B suffers no such drawbacks, and can thus be used to more accurately assess model performance at low χ.The fact that the EF B requires random compounds rather than true inactives enables much simpler benchmark creation.We sought to capitalize on this by creating the BayesBind benchmark, which we also explicitly designed for easier benchmarking of ML-based models.
The BayesBind benchmark is composed of validation and test sets whose targets are taken from the validation and test sets of the BigBind set, respectively.Molecules with known activity for each target Figure 1: EF B and EF (along with 95% confidence intervals) for various χ for randomly selected models and targets on DUD-E (data from Sunseri and Koes [9]).Notice that the EF B can be significantly higher than the EF at low χ due to the maxing-out problem of the EF.Both estimates can converge to 0 at sufficiently low χ due to both metrics failing to select any active molecules above the current threshold.
were clustered according to a Tanimoto similarity cutoff of 0.4 on 2048-bit Morgan fingerprints with radius 3. Targets were chosen with at least 50 clusters containing molecules with a pChEMBL value greater than 5 (10 µM).To limit the redundancy in the set, we chose a single molecule from each cluster (the molecule with median pChEMBL value).We saved all such molecules in our "actives" set, even those below 10 µM; this is so that users can choose different activity cutoff thresholds when analyzing their results.
To ensure the diversity of the targets of the benchmark, only one target was chosen for every pocket cluster (as defined by the pocket-level TM-score clusters used to split the BigBind set).We chose the target in each pocket cluster that would yield the highest number of actives for the benchmark.
For each target, a set of 1,000 random compounds was generated.These compounds were chosen from the set of compounds in the BigBind validation and test sets not known to bind to any pockets in the same pocket cluster as the current target.The compounds were clustered according to the same Tanimoto cutoff of 0.4, and a random compound was chosen from each cluster.We note that this selection of the random set guarantees that the active and random molecules come from the same region of chemical space.This is not the case in DUD-E, where decoys are generated from ZINC [10] and actives from ChEMBL [11].Even with the property matching employed by DUD-E, subtle differences still exist between the sets that can be exploited by ML models [5,12].
The final BayesBind set contains 14 targets in the validation set and 11 in the test set.There are an average of 297 known activities for each target, and an average of 187 actives with activity less than 10 µM.

The BayesBind ML subset
We desire to use the BayesBind set to rigorously benchmark ML models trained on the BigBind training set.Therefore, we desire a simple baseline with which to compare the performance of more complex models.To validate the splits of the BigBind set, we used a K-nearest-neighbors (KNN) classifier based on the ligand and pocket similarities of protein-ligand pairs in the train set [8].The KNN did no better than random on the BigBind test set; we used this as evidence that our splitting was rigorous.
We used this same KNN to validate the BayesBind benchmarking set.We were surprised to discover that the KNN significantly outperformed all other models, including a neural network model (BANANA) that was also trained on BigBind [8].On some targets, it was able to achieve enormous enrichment values (EF B max up to 233).The median EF B max it achieved on the full BayesBind set was 34 [21, 51] on the validation set and 11 [4.8, 21] on the test set.
How do we reconcile the KNN's performance on the BayesBind set with its poor classification performance on the BigBind test set overall?We argue that the BigBind splits were mostly successful at placing too-similar pockets in the same split, but had some noise.The BayesBind set is only composed of targets with many known active molecules, and therefore are likely to be in well-studied protein families.Thus there are likely to be many similar protein-ligand pairs across the BigBind set as a whole; because of the noise on the similarity metrics used, some of these were placed in the training set.It only takes a single similar datapoint in the training set to massively increase the performance of the KNN.
The performance of the KNN on this set is concerning and more research should be done on even more rigorous data splits.For the time being, however, we decided to create an ML subset of the BayesBind set composed only of targets that the KNN did relatively poorly on (EF B max less than 30).While the KNN still achieves nonrandom enrichment on many targets in this subset (and in fact still outperforms all the other models), there is still much room for improvement.A compelling ML model should be able to significantly outperform the KNN on this benchmark.
The results for the three models not trained on BigBind (Glide, Vina, and GNINA) are shown in figure 2, and the median results across all targets are shown in Table 2. Similarly, the results for all the models (including BANANA and the KNN) are shown in Figure 3 and Table 3.
Overall, we see a wide range of model performances; most models (except for the KNN) perform similarly to random for most targets, but some are able to achieve notable enrichment.We note, however, that even the highest EF B max achieved (50.[3.5, 100] by Vina on NR1H2), is significantly less than the enrichments observed by models on DUD-E (see Table 1 and Figure 1).Indeed, it is much less than the target enrichment we desire for virtual screening to be competitive with HTS (1,000).

Conclusion
We have shown that the traditional formula for estimating virtual screening performance, the enrichment factor, fails to estimate the true enrichment of a model in a real-life setting (very low active-to-inactive ratio and low selection fraction χ).We propose a new formula, the EF B , that is just as easy to calculate but is able to better estimate enrichment in realistic scenarios.Notably, this new formula only requires a set of known active molecules and a set of random molecules.We also suggest the use of the maximum Bayes enrichment factor (EF B max ), and especially its lower confidence bound, as a key summary metric for a model on a VS benchmark.We also developed a new benchmarking set, BayesBind, that pairs known actives for targets from the BigBind validation and test sets with random compounds.This set enables easy benchmarking of ML models that are trained on BigBind.We evaluated three common VS models on this benchmark, along with KNN and a neural network model (BANANA).
Our analysis shows that the VS models studied generally fail to achieve nonrandom enrichment on the targets in the set.This failure stands in contrast to the reported success of these methods on decoy-based benchmarks such as DUD-E.We hypothesize two possible reasons for this.First, the "active" modules for DUD-E targets tend to be tighter binders than those in the BayesBind set; DUD-E selects the lowest-affinity ligands in clusters and has reduced affinity cutoff values (down to 3 nM) for targets with many known actives.BayesBind, in contrast, takes the median-affinity ligand from each cluster.All our analysis here uses a 10 µM activity cutoff.We argue that this higher cutoff is much more realistic; nanomolar binders are vanishingly rare and almost never show up in primary screens.For now, we should pursue models that can reliably discover low µM binders.
Second, the choice of decoys in DUD-E could have a large effect on model performance.DUD-E chooses decoys from ZINC and its actives from ChEMBL; even though molecular properties are matched across the sets, there are still subtle differences that, for instance, ML models can abuse [12].BayesBind, by contrast, uses random compounds (not decoys) that are taken from the same chemical space (BigBind compounds, which ultimately come from ChEMBL).Notably, all the compounds in BigBind are annotated with activity against at least one target; the set could be enriched in nonspecific protein binders.If this is the case, the BayesBind set is also implicitly testing the model's ability to find selective compounds to some extent.This is a harder task, and an important one; real-life drug discovery projects are generally only interested in selective binders.
Overall, the BayesBind set is an important new benchmark for estimating VS model performance in a real-life scenario.The fact that no model does well on it stands as a testament to the difficulty of prospective virtual screening.We hope that our comrades will use this benchmark to rise to the challenge of making new methods succeed in this area.

Figure 2 :
Figure 2: Left: EF B max for Glide, Vina, and GNINA on each target in the full BayesBind set.Right: the EF B max 95% CI lower bounds for the same models and targets.On both plots, the dotted line indicates no enrichment (EF B of 1).

Figure 3 :
Figure 3: Left: EF Bmax for all the models (including BANANA and the KNN) on each target in the ML subset.Right: the EF B max 95% CI lower bounds for the same models and targets.On both plots, the dotted line indicates no enrichment (EF B of 1).

Figure A. 4 :
Figure A.4: Left: EF B max for every model (including the ML models KNN and BANANA) on each target in the full BayesBind set.Right: the EF B max lower bounds for the same models and targets.On both plots, the dotted line indicates no enrichment (EF B of 1).

Table 1 :
[9]parison of EF and EF B metrics for models on the DUD-E benchmarking set.Results were taken from Sunseri and Koes[9].

Table 2 :
Median AUCs, EF B 1% , and EF B max results for Glide, GNINA, and Vina on the full BayesBind set.95% confidence intervals are shown in brackets.

Table 3 :
Median AUCs, EF B 1% , and EF B max results for all models (including BANANA and the KNN) on the BayesBind ML subset.95% confidence intervals are shown in brackets.